Text Segmentation as a Supervised Learning Task
نویسندگان
چکیده
Text segmentation, the task of dividing a document into contiguous segments based on its semantic structure, is a longstanding challenge in language understanding. Previous work on text segmentation focused on unsupervised methods such as clustering or graph search, due to the paucity in labeled data. In this work, we formulate text segmentation as a supervised learning problem, and present a large new dataset for text segmentation that is automatically extracted and labeled from Wikipedia. Moreover, we develop a segmentation model based on this dataset and show that it generalizes well to unseen natural text.
منابع مشابه
Topic Segmentation
This chapter discusses the task of topic segmentation: automatically dividing single long recordings or transcripts into shorter, topically coherent segments. First, we look at the task itself, the applications which require it, and some ways to evaluate accuracy. We then explain the most influential approaches – generative and discriminative, supervised and unsupervised – and discuss their app...
متن کاملA domain adaption Word Segmenter For Sighan Backoff 2010
We present a Chinese word segmentation system which ran on the closed track of the simplified Chinese Word Segmentation task of CIPS-SIGHAN-CLP 2010 bakeoffs. Our segmenter was built using a HMM. To fulfill the cross-domain segmentation task, we use semi-supervised machine learning method to get the HMM model. Finally we get the mean result of four domains: P=0.719, R=0.72
متن کاملAn Iterative Approach to Text Segmentation
We present divSeg, a novel method for text segmentation that iteratively splits a portion of text at its weakest point in terms of the connectivity strength between two adjacent parts. To search for the weakest point, we apply two different measures: one is based on language modeling of text segmentation and the other, on the interconnectivity between two segments. Our solution produces a deep ...
متن کاملIncorporating Global Information into Supervised Learning for Chinese Word Segmentation
This paper presents a novel approach to Chinese word segmentation (CWS) that attempts to utilize global information (GI) such as co-occurrence of sub-sequences and outputs of unsupervised segmentation in the whole text for further enhancement of the state-of-the-art performance of conditional random fields (CRF) learning. In the existing work of CWS, supervised and unsupervised learning seldom ...
متن کاملEmotion Detection in Persian Text; A Machine Learning Model
This study aimed to develop a computational model for recognition of emotion in Persian text as a supervised machine learning problem. We considered Pluthchik emotion model as supervised learning criteria and Support Vector Machine (SVM) as baseline classifier. We also used NRC lexicon and contextual features as training data and components of the model. One hundred selected texts including pol...
متن کامل